Updates

Recap

Data

Rectangular data

  • Rectangular data refers to a data structure where information is organized into rows and columns.
    • Each row represents an observation or instance of the data.
    • Each column represents a variable or feature of the data.


Non-rectangular data - Hierarchical data (xml, html, json) - Time series data - Unstructed text data - Images/Pictures data

Data

Rectangular data

  • CSV (typical for rectangular/table-like data) and variants of CSV (tab-delimited, fix length etc.)
  • Excel spreadsheets (.xls)
  • Formats specific to statistical software (SPSS: .sav, STATA: .dat, etc.)
  • Built-in R datasets
  • Binary formats


Non-rectangular data - XML and JSON (useful for complex/high-dimensional data sets) - HTML (a markup language to define the structure and layout of webpages) - Time series - Text and images

Web Data, Complex Data Structures

A rectangular data set

father mother  name     age  gender
               John      33  male
               Julia     32  female
John   Julia   Jack       6  male
John   Julia   Jill       4  female
John   Julia   John jnr   2  male
               David     45  male
               Debbie    42  female
David  Debbie  Donald    16  male
David  Debbie  Dianne    12  female
What is the data about?

A rectangular data set

father mother  name     age  gender
               John      33  male
               Julia     32  female
John   Julia   Jack       6  male
John   Julia   Jill       4  female
John   Julia   John jnr   2  male
               David     45  male
               Debbie    42  female
David  Debbie  Donald    16  male
David  Debbie  Dianne    12  female
Which observations belong together?

A rectangular data set

father mother  name     age  gender
               John      33  male
               Julia     32  female
John   Julia   Jack       6  male
John   Julia   Jill       4  female
John   Julia   John jnr   2  male
               David     45  male
               Debbie    42  female
David  Debbie  Donald    16  male
David  Debbie  Dianne    12  female
Can a parser understand which observations belong together?

Deciphering XML

Revisiting COVID-19 data

dateRep,day,month,year,cases,deaths,countriesAndTerritories,geoId,countryterritoryCode,popData2019,continentExp,Cumulative_number_for_14_days_of_COVID-19_cases_per_100000
14/10/2020,14,10,2020,66,0,Afghanistan,AF,AFG,38041757,Asia,1.94523087
13/10/2020,13,10,2020,129,3,Afghanistan,AF,AFG,38041757,Asia,1.81116766
12/10/2020,12,10,2020,96,4,Afghanistan,AF,AFG,38041757,Asia,1.50361089

Revisiting COVID-19 data (in XML!)

<records>
<record>
<dateRep>14/10/2020</dateRep>
<day>14</day>
<month>10</month>
<year>2020</year>
<cases>66</cases>
<deaths>0</deaths>
<countriesAndTerritories>Afghanistan</countriesAndTerritories>
<geoId>AF</geoId>
<countryterritoryCode>AFG</countryterritoryCode>
<popData2019>38041757</popData2019>
<continentExp>Asia</continentExp>
<Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>1.94523087</Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>
</record>
<record>
<dateRep>13/10/2020</dateRep>

...
</records>

Revisiting COVID-19 (in XML!)

<records>
<record>
<dateRep>14/10/2020</dateRep>
<day>14</day>
<month>10</month>
<year>2020</year>
<cases>66</cases>
<deaths>0</deaths>
<countriesAndTerritories>Afghanistan</countriesAndTerritories>
<geoId>AF</geoId>
<countryterritoryCode>AFG</countryterritoryCode>
<popData2019>38041757</popData2019>
<continentExp>Asia</continentExp>
<Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>1.94523087</Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>
</record>
<record>
<dateRep>13/10/2020</dateRep>

...
</records>
What features does the format have? What is it’s logic/syntax?

XML syntax

<records>
<record>
<dateRep>14/10/2020</dateRep>
<day>14</day>
<month>10</month>
<year>2020</year>
<cases>66</cases>
<deaths>0</deaths>
<countriesAndTerritories>Afghanistan</countriesAndTerritories>
<geoId>AF</geoId>
<countryterritoryCode>AFG</countryterritoryCode>
<popData2019>38041757</popData2019>
<continentExp>Asia</continentExp>
<Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>1.94523087</Cumulative_number_for_14_days_of_COVID-19_cases_per_100000>
</record>
<record>
<dateRep>13/10/2020</dateRep>

...
</records>

XML syntax

The actual content we know from the csv-type example above is nested between the ‘records’-tags:

  <records>
...
  </records>

XML syntax: Temperature Data example

There are two principal ways to link variable names to values.

    <variable>Monthly Surface Clear-sky Temperature (ISCCP) (Celsius)</variable>
    <filename>ISCCPMonthly_avg.nc</filename>
    <filepath>/usr/local/fer_data/data/</filepath>
    <badflag>-1.E+34</badflag>
    <subset>48 points (TIME)</subset>
    <longitude>123.8W(-123.8)</longitude>
    <latitude>48.8S</latitude>
    <case date="16-JAN-1994" temperature="9.200012" />
    <case date="16-FEB-1994" temperature="10.70001" />
    <case date="16-MAR-1994" temperature="7.5" />
    <case date="16-APR-1994" temperature="8.100006" />

XML syntax

  1. Define opening and closing XML-tags with the variable name and surround the value with them, such as in <filename>ISCCPMonthly_avg.nc</filename>.
  2. Encapsulate the values within one tag by defining tag-attributes such as in <case date="16-JAN-1994" temperature="9.200012" />.

XML syntax

Attributes-based:

    <case date="16-JAN-1994" temperature="9.200012" />
    <case date="16-FEB-1994" temperature="10.70001" />
    <case date="16-MAR-1994" temperature="7.5" />
    <case date="16-APR-1994" temperature="8.100006" />

XML syntax

Tag-based:

  <cases>    
    <case>
      <date>16-JAN-1994<date/>
      <temperature>9.200012<temperature/>
    <case/>
    <case>
      <date>16-FEB-1994<date/>
      <temperature>10.70001<temperature/>
    <case/>
    <case>
      <date>16-MAR-1994<date/>
      <temperature>7.5<temperature/>
    <case/>
    <case>
      <date>16-APR-1994<date/>
      <temperature>8.100006<temperature/>
    <case/>
  <cases/>

Insights: CSV vs. XML

  • Represent much more complex (multi-dimensional) data in XML-files than what is possible in CSVs.
  • Self-explanatory syntax: machine-readable and human-readable.
  • Tags are part of the syntax, give both structure and name variables.

Deciphering JSON

JSON syntax

  • Key difference to XML: no tags, but attribute-value pairs.
  • A substitute for XML (often encountered in similar usage domains).

XML:

<person>
  <firstName>John</firstName>
  <lastName>Smith</lastName>
  <age>25</age>
  <address>
    <streetAddress>21 2nd Street</streetAddress>
    <city>New York</city>
    <state>NY</state>
    <postalCode>10021</postalCode>
  </address>
  <phoneNumber>
    <type>home</type>
    <number>212 555-1234</number>
  </phoneNumber>
  <phoneNumber>
    <type>fax</type>
    <number>646 555-4567</number>
  </phoneNumber>
  <gender>
    <type>male</type>
  </gender>
</person>

XML:

<person>
  <firstName>John</firstName>
  <lastName>Smith</lastName>
  <age>25</age>
  <address>
    <streetAddress>21 2nd Street</streetAddress>
    <city>New York</city>
    <state>NY</state>
    <postalCode>10021</postalCode>
  </address>
  <phoneNumber>
    <type>home</type>
    <number>212 555-1234</number>
  </phoneNumber>
  <phoneNumber>
    <type>fax</type>
    <number>646 555-4567</number>
  </phoneNumber>
  <gender>
    <type>male</type>
  </gender>
</person>

JSON:

{"firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021"
  },
  "phoneNumber": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "fax",
      "number": "646 555-4567"
    }
  ],
  "gender": {
    "type": "male"
  }
}

XML:

<person>
  <firstName>John</firstName>
  <lastName>Smith</lastName>
 
</person>

JSON:

{"firstName": "John",
  "lastName": "Smith",

}

HTML: Websites

HTML: Code to build webpages

HTML documents contain data!

HTML documents: code and data!

HTML documents/webpages consist of ‘semi-structured data’:

  • A webpage can contain a HTML-table (structured data)…
  • …but likely also contains just raw text (unstructured data).

     <!DOCTYPE html>

     <html>
         <head>
             <title>hello, world</title>
         </head>
         <body>
             <h2> hello, world </h2>
         </body>
     </html>
Similarities to other formats?

HTML document as a ‘tree’

Two ways to read a webpage into R

Text as Data

Handling text data for analysis

Data structure: text corpus

Working with text data in R: Quanteda

Image Data

Basic data structures

  • Raster images: a matrix of pixels, as well as the color of each pixel.
  • Vector-based images: text files that store the coordinates of points on a surface and how these dots are connected (or not) by lines.

Raster images

Raster and vector images in R

Use cases in economic research and beyond

  • Extract text from historical documents (scan, use OCR)
  • Use machine learning to label text (too costly to do manually)
  • Extract information from maps

Importing Web Data Formats

XML in R

## {xml_document}
## <customers>
## [1] <person>\n  <name>John Doe</name>\n  <orders>\n    <product> x </product>\n    <product> y </ ...
## [2] <person>\n  <name>Peter Pan</name>\n  <orders>\n    <product> a </product>\n    <product> x < ...
# load packages
library(xml2)

# parse XML, represent XML document as R object
xml_doc <- read_xml("data/customers.xml")
xml_doc

XML in R: tree-structure

‘customers’ is the root-node, ‘persons’ are it’s children:

# navigate downwards
persons <- xml_children(xml_doc) 
persons
## {xml_nodeset (2)}
## [1] <person>\n  <name>John Doe</name>\n  <orders>\n    <product> x </product>\n    <product> y </ ...
## [2] <person>\n  <name>Peter Pan</name>\n  <orders>\n    <product> a </product>\n    <product> x < ...

XML in R: tree-structure

Navigate sidewards and upwards

# navigate sidewards
persons[1]
## {xml_nodeset (1)}
## [1] <person>\n  <name>John Doe</name>\n  <orders>\n    <product> x </product>\n    <product> y </ ...
xml_siblings(persons[[1]])
## {xml_nodeset (1)}
## [1] <person>\n  <name>Peter Pan</name>\n  <orders>\n    <product> a </product>\n    <product> x < ...
# navigate upwards
xml_parents(persons)
## {xml_nodeset (1)}
## [1] <customers>\n  <person>\n    <name>John Doe</name>\n    <orders>\n      <product> x </product ...

XML in R: tree-structure

Extract specific parts of the data:

# find data via XPath
customer_names <- xml_find_all(xml_doc, xpath = ".//name")
# extract the data as text
xml_text(customer_names)
## [1] "John Doe"  "Peter Pan"

JSON in R

# load packages
library(jsonlite)

# parse the JSON-document shown in the example above
json_doc <- fromJSON("data/person.json")

# look at the structure of the document
str(json_doc)
## List of 6
##  $ firstName  : chr "John"
##  $ lastName   : chr "Smith"
##  $ age        : int 25
##  $ address    :List of 4
##   ..$ streetAddress: chr "21 2nd Street"
##   ..$ city         : chr "New York"
##   ..$ state        : chr "NY"
##   ..$ postalCode   : chr "10021"
##  $ phoneNumber:'data.frame': 2 obs. of  2 variables:
##   ..$ type  : chr [1:2] "home" "fax"
##   ..$ number: chr [1:2] "212 555-1234" "646 555-4567"
##  $ gender     :List of 1
##   ..$ type: chr "male"

JSON in R

The nesting structure is represented as a nested list:

# navigate the nested lists, extract data
# extract the address part
json_doc$address
## $streetAddress
## [1] "21 2nd Street"
## 
## $city
## [1] "New York"
## 
## $state
## [1] "NY"
## 
## $postalCode
## [1] "10021"
# extract the gender (type)
json_doc$gender$type
## [1] "male"

Tutorial (advanced): Importing data from a HTML table

-> Exercise session next week

Q&A

References